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Abstract 

XML is a standard and universal language for rep¬ 
resenting information. XML processing is supported 
by two key frameworks: DOM and SAX. SAX is ef¬ 
ficient, but leaves the developer to encode much of 
the processing. This paper introduces a language for 
expressing XML-based languages via grammars that 
can be used to process XML documents and synthe¬ 
size arbitrary values. The language is declarative and 
shields the developer from SAX implementation de¬ 
tails. The language is specified and an efficient im¬ 
plementation is defined as an abstract machine. 


1 Introduction 

XML is a standard and universal language for repre¬ 
senting information. It is used to represent informa¬ 
tion including: financial trading; controlling robotic 
telescopes; clinical data; and, music. An XML doc¬ 
ument consists of a tree of elements. Each element 
contains a tag, some attribute name-value pairs and 
a sequence of child elements. Leaf nodes may be un¬ 
formatted text. 

In order for an XML document to be processed, 
it must conform to a predefined format. The format 
defines a collection of tags that can be used in the 
document, the attributes for an element with a given 
tag and the rules of parent-child element composi¬ 
tion. Such a format defines a language and any XML 
document that conforms to the format is written in 
the language. If the format is defined to support in¬ 
formation for a specific application domain (such as 
share prices or system configuration) then it consti¬ 
tutes a domain specific language (DSL). 


How should an XML document be processed? An 
application that processes XML will need to read the 
document and translate it into some form of useful 
information. This is often achieved using two ap¬ 
proaches: translate the XML into data that is then 
processed by the application; translate the XML into 
calls on an application specific API. The first ap¬ 
proach can be thought of as a mapping from one data 
format to another and the second as executing the 
XML document. Sometimes a mixture of the two is 
used. 

In either case, working with XML involves reading 
a document and processing the information in some 
way. There are two standard ways of processing XML 
data: 

DOM A DOM processor [5j translates the XML 
document into a faithful in-memory tree 
and passes this data structure to the ap¬ 
plication. The application can then tra¬ 
verse the tree and perform any appropri¬ 
ate actions. 

SAX A SAX framework [7] traverses the XML 
document in a predefined order and gen¬ 
erates events for each type of tree-node 
that it encounters. The application sup¬ 
plies the SAX framework with an adapter 
that implements handlers for each event 
type. The handlers perform application 
specific processing. 

Both DOM and SAX processing will achieve the de¬ 
sired result. However, there are significant draw¬ 
backs to the DOM approach since it requires the com¬ 
plete XML tree to be represented in memory before 
application-specific processing can take place. Firstly 
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the XML document may be very large so its repre¬ 
sentation in-memory may incur an unreasonable over¬ 
head. Secondly, the DOM approach is not compatible 
with applications whose life-cycle may be indefinite, 
for example interactive applications. 

The SAX approach does not suffer from these 
drawbacks since the processing of the XML data is 
interleaved with application specific event handlers. 
Unfortunately, compared to DOM-based processing, 
writing a SAX processor is complex since the SAX 
framework effectively flattens the XML tree and gen¬ 
erates a sequence of events. 

SAX-based processing of a DSL involves recogniz¬ 
ing sequences of events that arise from a flattened 
XML document and performing actions that either 
synthesize a data structure or make calls an an ap¬ 
plication API. This processing is the same as the ac¬ 
tions of a parser which takes a description of a lan¬ 
guage (a grammar) and processes some input. Given 
a suitable representation for XML grammars and an 
efficient parsing engine then SAX processing of XML 
DSLs can be made both convenient and efficient. 

This paper describes an approach to parsing XML 
grammars using a SAX framework and shows how 
a standard LL(1) parsing technique can be used to 
process XML documents. The grammar language is 
novel in that it uses a convenient syntax in terms 
of parametric parsing rules and can easily be imple¬ 
mented using an efficient parsing machine. The lan¬ 
guage has been implemented and is available as part 
of the open-source XMF system. 

The paper is structured as follows: section [2] de¬ 
scribes a language for representing XML grammars; 
section [3] specifies how the XML grammars process 
XML documents and synthesize results; section[I]de- 
fines a parsing machine that is driven by an XML 
grammar and processes an XML document as de¬ 
scribed in the specification; finally, section [5] reviews 
the paper and compares the results with similar sys¬ 
tems. 

2 XML Grammars 

An XML grammar is a collection of rules. The rules 
specify a set of legal XML documents; if document d 


is in the set of legal documents for grammar g then 
g is satisfied by d. A grammar also specifies a value 
for each XML document. If a document d satisfies 
grammar g with value v then parsing d with respect 
to g produces, or synthesizes , value v. 

The XMF system implements a parser for XML 
grammars. The grammars are specified in a concrete 
language described in section 12.11 The XMF-based 
grammar language is useful for humans, but long- 
winded when describing precisely how the parsing 
mechanism works. Therefore, section 12.21 defines an 
equivalent abstract syntax for the grammar language 
that is used in the rest of the paper. 

2.1 Example 

XMF implements XML grammars using a language 
that is based on BNF. A grammar consists of rules 
that define non-terminals. The body of a rule is a 
pattern that consists of element specifications (ter¬ 
minals), rule calls (non-terminals), bindings and ac¬ 
tions. The following is an example of an XML gram¬ 
mar that processes a simple model language. The 
models consist of packages, classes and associations. 

The rest of this section describes the grammar in 
more detail. 

(1) OGrammar Models 

(2) Attribute ::= 

(3) <Attribute name type/> 

(4) { Attribute(name,type) }. 

Class ::= 

<Class name isAbstract id> 

(5) elements = ClassElement* 

(6) </Class> { 

(7) elements->iterate(e c = Class(name,isAbstract) I 

c.add(e)) }. 

(8) ClassElement ::= Attribute I Operation. 

Operation ::= 

<0peration name> 
as = Arg* 

</0peration> { Operation(name,as) }. 

Package ::= 

<Package name> 

elements = PackageElement* 

</Package> { 

elements->iterate(e p = Package(name) I 
p.add(e)) >. 
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PackageElement ::= Package I Class I Assoc. 

(9) Assoc : : = 

Association name> 

<End nl=name tl=type/> 

<End n2=name t2=type/> 

</Association> { 

(10) Association(n,End(nl,tl),End(n2,t2)) }. 
end 


The grammar is defined using the XML grammar 
DSL defined by XMF and starts at line (1). Lines 
(2-4) are a typical example of a grammar rule. The 
name of the rule is Attribute. The body specifies 
that Attribute expects an XML element represent¬ 
ing an attribute with a name and a type. The vari¬ 
ables name and type are bound to the values of the 
corresponding XML attributes. Line (4) defines an 
action that occurs after the XML element has been 
consumed. The action constructs an instance of the 
XMF class Attribute and supplies the values of name 
and type. Each component of a rule-body returns a 
value. The value of the last component is that re¬ 
turned by a call of the rule. In this case the rule 
returns a new attribute instance. 

Line (5) is interesting because it shows a call of the 
rule ClassElement and the use of the * decoration to 
specify that ClassElement should be called repeatedly 
until it fails to be satisfied by the XML input. The 
result of a component decorated with a * is a sequence 
of elements. 

Line ( 8 ) is interesting because it shows how alter¬ 
natives are specified in a rule. A ClassElement is 
either an Attribute or an Operation. 

Parsing starts with an initial rule and a tree (the 
root of the document). Each rule element is processed 
in turn. Tree elements are consumed each time an 
element specification (e.g. line 3) is encountered in 
a rule. If the tags of the root element in the tree 
and the element specification match then the root is 
consumed and the parse proceeds with the child el¬ 
ements. If the comparison ever fails, and no further 
choices are available, then the parse fails and no val¬ 
ues are produced. 


2.2 Abstract Syntax 

In the rest of this paper we specify a parser for the 
XML grammar language and give its implementation. 
The concrete language described in the previous sec¬ 
tion is not really suitable for precise descriptions of 
the specification and parsing machinery. Therefore, 
this section gives an equivalent abstract syntax de¬ 
scription of the essential features. 

An abstract syntax for the grammar language is 
used as defined below where A is a set of names, E is 
a set of expressions, {.} is the power-set constructor, 
[.] constructs a set of sequences from an underlying 
type, V is a set of values that can be synthesized by 
a grammar and t(P ,...) denotes the set of all terms 
with functor t constructed from the supplied sets P 
etc. 


g&G = 

= M 

grammars 

cG C = 

= N x [A] x B 

clauses 

beB = 

= 

clause bodies 


or(.B, B ) 

disjunction 


and(B , B) 

conjunction 


bind([N ], B) 

binding 


star(B) 

repetition 


empty 

no elements 


any 

any element 


ok 

skip 


text 

raw text 


call(N , [A]) 

nonterminal 


actions{[E\) 

synthesis 


N x {N x N} xT x B 

element spec 

7 g r 

= B 

guarded bodies 

p G 4> = 

= A-s- V 

environments 

X £ X = 

= 

XML 


A x 4- x [A] 

element 


text(S) 

text 


A clause will be written c(v ) > b where c is the name, 
v are the arguments and b is the body. A disjunction 
will be written b \ b' and a conjunction bb’. A call 
will be written n(e) and actions [e]. Repetition will 
be written b*. Bindings will be written h = b. 

An element specification is (t, A, 7 , b) which is to 
be interpreted as follows: t is a tag, A' is a set of 
names (actually a set of name pairs to allow vari¬ 
ables and attribute names to be different, however 
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we simplify this in definitions by assuming that they 
are always the same) that specify the attributes to 
be bound when matching against an XML element. 
The guarded bodies 7 is a function, viewed as a set 
of pairs, associating boolean expressions with clause 
body elements. The element b is the else-clause. 

Environments p are just functions from names to 
values. They will be extended in the normal way 
p[n 1 — y 1 ;] and p © p' with shadowing on the right. 
The environment p\N is the same as p except that 
the domain is restricted to the set of names N. 

Sequences of elements are written s and are con¬ 
structed from the empty sequence [], concatenation 
of sequences p + q and consing x : s. 

Expressions are used to represent guards in ele¬ 
ment specifications, arguments in calls and synthe¬ 
sizing actions. An expression e may contain variable 
references and denotes a value e(p). Sequences of ac¬ 
tions e generalize naturally. 

The Attribute rule body from the example concrete 
grammar described in section 12.11 is represented as 
follows using abstract syntax (and a suitable action 
ei): 

(Attribute, {name, type } , true, [ei]) 

The Operation rule body is: 

(Operation, {name} ,true, [as] = Arg()* fa]) 

2.3 Well Formedness Rules 

Not all syntactically correct grammar rules are mean¬ 
ingful. In order for a rule to be correct it must con¬ 
form to variable binding well-formedness rules that 
require a variable to be bound before it can be ref¬ 
erenced. For example the following rule is not mean¬ 
ingful because the use of disjunction means that the 
variable x cannot be guaranteed to be bound in all 
cases: 

W{) > ([*] = XQ | [y] = YQ) Z(x) 

The well-formedness rules depend on two functions 
that are defined on the abstract syntax. The function 
free : B —> {IV} is maps a rule element to a set 


of names that are freely referenced in that element. 
The function bound : B —>■ {N} maps a rule element 
to the variable names that are bound by the element 
and subsequently available once the element has been 
successfully parsed. 

A rule element b is well formed when, given a con¬ 
text of bound names N, the relationship N b b holds 
as defined in figure [T] A rule n(n) t> b is well formed 
when {n} b b and a grammar is well-formed when all 
of its rules are well-formed. 

Rule Wor defines that names available outside a 
disjunction must be bound by both parts of the dis¬ 
junction. Wand defines that binding is sequential 
and cumulative. Wei defines that the names used 
in element specification guards must be in scope and 
that the attributes are scoped over the guards and the 
child elements. Wbind defines that a binding element 
introduces names that can be used in clause body 
element that occur subsequently. Both Wcall and 
Wsynth require that freely referenced names must be 
bound. 

3 Specification 

The XML grammar language is used to specify XML 
languages. A grammar defines a collection of XML 
trees; each tree is a member of the XML language 
defined by the grammar. The association between an 
XML grammar and a set of XML trees is defined as 
a relation of the form: 

g, b b x + x', p, x', v 

where g is the grammar, b is a clause body, a; is a se¬ 
quence of XML elements, p is an environment associ¬ 
ating variables with values, and v is a value. The re¬ 
lation states that an XML document d = doc(t, p, x) 
satisfies a grammar g with starting rule named n syn¬ 
thesizing value v when g,n() b [(t, p, 2 ;)], [],[], v, i.e. 
calling the rule named n with no arguments and in an 
empty variable environment with respect to the root 
XML element must consume the complete element 
and produce a value. 

The relationship is defined in figure [5] Rules SOr\ 
and SOr? specify the conditions under which a dis¬ 
junction recognises a sequence of XML trees. Two 
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N \- bi 


TV b 61 


N\-b 2 

(W or) 

TV U bound(b\) b b 2 

(Wand) 

N U (bound(bi) (~l bound(b 2 ) b b± \ b 2 

TV U bound(b\) U bound(b 2 ) b bib 2 

free( 7 ) C N U N 1 




N U N' b b Mb £ ran ( 7 ) 




TV U TV' b b 

(Wei) 

TV b b 

(Wbind) 

TV b (n, TV', 7 ,&) 

TV U h b bind(h , b) 

TV b empty 

(W empty) 

TV b any 

(Wany) 

TV b text 

(Wtext) 

free(n(e)) C TV 

TV b n(e) 

(W call) 

free({e}) C TV 

TV b {e} 

(W synth) 

TV b ok 

(Wok) 


Figure 1: Well-Formedness 


9 ,b b x,p,x',v 
g,b\b' b x,p, x',v 

g,bi I- x,pi,x',vi 
g,b 2 b x',p 2 ,x",v 2 
9 , b\b 2 b x, pi ® p 2 ,x",v 2 

g, empty b [],p, [],null 


isText(x) 

g, text b x : x, p, xs, x 

e(p) = v 

g, [e] b x,p,x,v 

9,l(g) I - x,p®(p'\N),x',v 
_ g(p®(p'\N)) _ 

9, (t, N, 7 , b) b (t, p', x) : y, p, y, v 


(Son ) 

(Sand) 
(S empty) 

( Stext) 

(.Ssynth) 

(Sell) 


g,b\~ x,p,x',v 
9, b'\b b x,p,x’,v 


g,b\~ x, p, x',v 


g,n = b b x, p[rii 1 —> Vi],x', v 

g , any b x : x, p, x, x 

g(n) = n(v) O b 
< 7 , b b x, [D e(p)\ ® p ’, x', v 
g,n(e) b x,p,x',v 

g, ok b x, p, x, null 

g,b b x,p®(p'\N),x',v 
Sg € dom( 7 ) • g(p ® (p'\N)) 
9, (t, N, 7 , b) b (t, p', x) : y, p, y, v 


(Sor 2 ) 

(Sbind) 

(Sany) 

(Scall) 

(Sok) 

(Sel 2 ) 


Figure 2: Specification 
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rules are required in order to allow the recognition 
to succeed if either of the two patterns succeed. The 
rule Sand specifies the relationship between two pat¬ 
terns in sequence. The first pattern consumes a prefix 
of the sequence of XML trees and passes the remain¬ 
ing trees to the second pattern. The two binding 
environments associated with the individual patterns 
are combined with ® so that multiple occurrences of 
the same variable name shadow on the right. This 
rule forces the binding for (x=A)(y=B) to contain a 
binding for both x and y. It also forces the environ¬ 
ment for (x=A)(x=B) to contain a single binding for 
x that is derived from B. The rule Sbind describes 
the case in which variables are bound to the result of 
recognizing a pattern. 

The rule Sempty forces the sequence of XML trees 
to be empty and synthesizes the null value. This is 
to be contrasted with the rule Sany that consumes a 
single tree. Empty can be used to force an XML leaf 
element: ( X , [], [], empty) is a pattern that matches 
an XML element with a tag X and with no chil¬ 
dren. This is to be contrasted with (X, [],[], any ) that 
matches an XML element with tag X and a single 
child element. The pattern (X, [],[], any*) matches 
a single tree with a tag X and with any number of 
children. 

The rule Stext recognizes a single XML text ele¬ 
ment. The rule Scall is used to call a rule. Each rule 
may have more than one definition in the grammar 
and has 0 or more arguments. The argument values 
are supplied at the point of call and are expressions 
that are evaluated with respect to the current vari¬ 
able bindings. The associations between the formal 
parameters and the actual parameters form the ini¬ 
tial environment for the call. The result of the call 
is defined by the value produced by the body of the 
clause. 

The rule Ssynth defines how values are synthe¬ 
sized. An action is a known function. It is supplied 
with values that are constructed by evaluating ex¬ 
pressions in the context of an environment. The rule 
describes the case where there is a sequence of expres¬ 
sions. This allows a single pattern to return multiple 
values as in the following rules: 


X() > [t>, ic] = Y{) [f; + «;] 

y() > [10,20] 

where the rule X binds a pair of values v and w by 
calling Y (which returns a pair of values 10 and 20). 
X terminates by returning the sum of v and w (a 
single value). 

The rule Sel describes how XML elements are pro¬ 
cessed. An element pattern involves a tag t, some at¬ 
tribute names A, some clauses consisting of a guard 
and a pattern, and an otherwise pattern. Each guard 
is a predicate that may reference variables whose val¬ 
ues are bound in the environment p. If the next XML 
element matches the required tag and the children 
match a clause-pattern whose guard is satisfied then 
the XML element is consumed and the value synthe¬ 
sized by the clause-pattern is returned. 

4 Implementation 

The previous section has specified how XML gram¬ 
mars can be used to recognize an XML document 
and to synthesize a value in the process. However 
the specification does not explain how the parsing 
process works. The aim of this paper is to explain 
how a SAX parser can be made to efficiently parse 
an XML document with respect to a grammar. 

Efficient parsing will be performed by translating 
the grammar into a lookup table that predicts what 
to do based on the next SAX event. Providing that 
the grammar has a specific property that makes each 
lookup deterministic (the LL(1) property) then the 
table and SAX events can be used to drive an efficient 
parsing machine. 

To create the table from an XML grammar, the 
grammar must be translated into a normal form. Sec¬ 
tion @TT] describes this translation and section S2] de¬ 
fines an algorithm that constructs the tables. Finally 
section 14.31 defines a parsing machine. 

4.1 Normal Form 

In order to process the grammar using a parsing en¬ 
gine it is necessary to lift out all the disjunctions to 
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the top level so that they become alternative defini¬ 
tions for clauses. The following equivalence is used 
to perform the transformation: 

GU{c(m) > A(X\Y)B} = 

( c(m) > A(n = d(v))B 'l 
G U < d(v) > X {h} > 

[ d(v) > Y {h} J 


into a normal form which is suitable for predictive 
parsing. The main aim is to get all of the disjunctions 
lifted to the top-level of the grammar so that calls 
can be indexed in terms of element tags. All the 
other transformations support this aim by allowing 
variable bindings to be passed as arguments in calls 
and the results of calls to be bound appropriately. 

Consider the following grammar before transfor¬ 
mation into normal form: 


where v = free{X\Y) and h = bound(X\Y). The 
idea is that any disjunction X\Y makes reference to 
some variables v and binds some variables n. The dis¬ 
junction can be translated to a new clause with two 
alternative definitions so long as the referenced vari¬ 
ables are passed as arguments and the bound values 
are returned as results. 

A sirnlar equivalence holds for element patterns: 


OGrammar Test 

A ::= <A> b = (B I C)* </A> {b>. 
B ::= <B n=name/> {n}. 

C ::= <C n=name/> {n}. 
end 

and after transformation: 


OGrammar Test 


G U < c(m) > A (t. N, [J §i m- bi, b) B > = 


GU< 


c(m) > A(n = (t, N, [J gi m- n*(i?,), n(w))B 

i=l,n 

[J Uiivi) > bi[wi\ 

i=l,n 

n(u>) > b[h] 


A ::= b = <A> Cl </A> {b>. 

Cl ::= x = C2 xs = Cl { Cons(x,xs) 
Cl ::= { Nil >. 

C2 ::= B. 

C2 ::= C. 

. B ::= <B n = name> OK </B> {n}. 

C ::= <C n = name> OK </C> {n}. 
end 

> 

4.2 Lookahead Tables 


>. 


The guarded patterns and else-pattern are trans¬ 
formed to calls of new non-terminals. The free and 
bound variables are handled in the same way as dis¬ 
junction. 

Repetition can be removed using the following 
equivalence: 

G U {c(m) > AX*B) = 

( c(fh) > A(d(v))B 'j 

G U < d(y) > [x = X)(xs = d(v))[x : xs] > 

[ d(v) [> ok J 

The equivalences defined above are used left-to-right 
as rewrite rules in order to transform XML grammars 


Parsing is performed with respect to lookahead ta¬ 
bles. Each clause defines a lookahead table that maps 
element tags to sequences of patterns. The lookahead 
table is constructed using the following clause prop¬ 
erties: 

null A clause is null if it is satisfied without 

processing any XML elements. 

first The set of first tags associated with a 

clause. A clause will process a sequence 
of XML elements. The first set of a clause 
contains all tags for the head element of 
all such sequences. If the first sets of a 
clause with alternative definitions are dis¬ 
joint for each dehnion then they can be 
used to predict which definition to use. 
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follow The set of follow tags associated with a 
clause. A clause may be satisfied by an 
empty sequence of XML elements. On 
completing the clause, the parse will con¬ 
tinue to process a sequence of XML ele¬ 
ments. The follow set of a clause contains 
all tags for the head element of such se¬ 
quences, i.e. the XML tags that predict 
no consumption of elements by a clause. 

Section 14.2.11 defines the null operation, section B.2.21 
specifies an algorithm that calculates the first and 
follow sets of grammar rules and finally section 14.2.31 
shows how tables are constructed and gives an exam¬ 
ple. 


4.2.1 Definition of Null 

A clause element is null when it can be parsed 
without consuming any XML input. Predictive table 
construction uses the null property to construct first 
and follow sets that are used to populate the table 
for each gramar rule. The null operation is defined 
by case analysis on the elements as follows: 


null(n(e ), g) 
null (bib', g ) 
null(bb', g ) 
null(h = b , g) 
null(empty, g) 
null(any , g) 
null (ok, g) 
null(text, g) 
null([e],g) 
null((t, TV, 7 , 6 ), 5 ) 


null(b,g), n(v) O b £ g 
null(b, g) V null(b ’, g) 
null(b, g) A null(b', g) 
null(b, g) 
true 
false 
true 
false 
true 
false 


4.2.2 Calculation of First and Follow Sets 

Calculation of the first and follow sets for the gram¬ 
mar is performed by the algorithm defined in figure 
[3l The rest of this section describes the algorithm. 

The sets are calculated in a loop (1-26) that contin¬ 
ues until a fixed point is reached. Each clause in the 
grammar is processed in turn (2). If every pattern in 
a clause named c is null then the clause c is marked as 
null (4). For each pattern b in the body of the clause 
(6), if the pattern is an element (8) then normal form 


has ensured that the element clauses and the else pat¬ 
tern are all calls. Therefore, all of the clauses called 
in the body of the element (9) are followed by the 
tag t (10). If the prefix B’ of the clause body is null 
(13) then the clause c is predicted by the first set of 
b (14). If the element b is a call and is followed by 
null patterns (16) then the tags following b are the 
same as the tags following c. For all patterns b’ that 
occur after b in the clause body (19) if b is a call and 
the intermediate patterns are null (20) then the tags 
following b are those that predict b’. 

A grammar is deterministic (or LL(1)) if there is 
at most one choice at any given time. This is an 
important property because it makes parsing efficient 
and relatively simple. Given a situation in which a 
rule is called, if the grammar is deterministic then 
the next element tag (as supplied by the SAX event 
mechanism) determines the grammar rule to be used. 
If the grammar is not deterministic then more SAX 
events have to be consumed in order to decide how 
to proceed or the parsing machinery must support 
backtracking. 


4.2.3 Table Construction 

XML grammars are used to process XML documents 
using a predictive parser. The parser processes a 
lookup table with respect to the grammar and the 
next XML element. Each time a clause c is called in 
the grammar with respect to an XML element with 
tag t, the relation predict(c,t) is used to lookup the 
appropriate clause definition. The prediction relation 
is defined in figure [I] 

Fortunately, it is easy to check whether an XML 
grammar is deterministic. If the parse table contains 
at most a single entry in each cell, then the grammar 
is LL(1). Only LL(1) grammars are supported by the 
parsing machine defined in the next section. 

Figure [5] shows the lookup table corresponding to 
the example defined in section 14.11 This table has 
been produced by calculating the first and follow sets 
as defined in figure [3] and then populating the table 
using the algorithm in figure [4] Since all cells have at 
most one entry, the grammar is LL(1), for example: 


8 









( 1 ) 

repeat 


( 2 ) 

for (c(v) > B) £ G 


(3) 

if V6 £ B • nullfb) 


(4) 

then null (c) = true 


(5) 

end 


( 6 ) 

let B' + {b} + B" = B 


(7) 

in case b of 


( 8 ) 

(f, g i —> h, n) do 


(9) 

for x in n : n 


( 10 ) 

follow(x) = follow(x) U {/t} 


( 11 ) 

end 


( 12 ) 

end 


(13) 

if Mb £ B' • null{b) 


(14) 

then first(c) = first(c) U first(b ) 


(15) 

end 


(16) 

if isCalKb) A V6 £ B" • null(b) 


(17) 

then follow(b) = follow(b ) U follow(c) 


(18) 

end 


(19) 

let D + {b'} + E = B" 


( 20 ) 

in if isCall(b)A\/b£D • null(b) 


( 21 ) 

then follow(b) = follow(b) U first{b') 


( 22 ) 

end 


(23) 

end 


(24) 

end 


(25) 

end 


(26) 

until not changed 


Figure 3: Calculation of First and Follow Sets 

( 1 ) 

for c(h) > B + {b} + B' £ G where (V6 £ B • null(b)) A ( first(b ) ^ 0) 

(2) 

fort £ firs 

t(P) 

(3) 

predict(c 

, t) = c(n) > B + {5} + B' 



Figure 4: Definition of Predict 



Figure 5: Predictive Parsing Table 
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predict(B , B) = B() > (B, {(n, name)} , ok)[n) 

4.3 Parser 

A parse is performed using an engine that processes 
SAX events in the context of a lookup table. The 
engine is defined using a state transition function. 
The states of the engine are defined as follows: 


a £ X 

= P x $ x [V] x [S] x D 

states 

p g P 

= [B + I} 

programs 

i G I 

= 

instructions 


any(N) 

any end 


1 [N] = 

bind 


1 /N 

tag end 

s G S 

= 

SAX events 


N x 4> 

start tag 


1 /N 

end tag 


text(N) 

text 

d € D 

= 

dumps 



call frame 


1 T 

empty 


A machine state (p, p , v, x , d ) consists of a program p 
that is a sequence of clause elements and machine in¬ 
structions, an environment p that associates variables 
that are currently in scope with values, a stack of val¬ 
ues v. a sequence of SAX events x, and a dump d. The 
idea is that the program drives the machine. At each 
transition the next program element and the current 
SAX event determines that happens. The current 
context is saved on the dump when a grammar wule 
is called and then the context is restored when the 
rule returns. Values are pushed onto the value stack 
and, if the process terminates successfully then the 
synthesized value is found at the head of the stack. 

The machine executes with respect to an LL(1) 
lookup table that is represented as a function 
predict : N x N —> C mapping clause names and 
XML element tags to grammar clauses. Given an 
initial call c(v)of a grammar rule, the machine uses a 
state transition function to transform a starting state 
into a terminal state as follows: 

([c(®)].D> □»[*]> T) 1 —>* (Q,D,MjD. t ) 


If a terminal state cannot be reached then the parse 
fails. The transition function is defined in figure [ 6 ] 
The machine is driven by case analysis at the head 
of the program. Rules (1-3) define how a call is per¬ 
formed. The next SAX event is either a start tag, 
an end tag or text. In each case the lookup table is 
used to determine which rule is being called (the ta¬ 
ble cannot be ambiguous and may contain no entry 
in which case the parse fails). If the table contains an 
entry for the SAX event then the current context is 
saved on the dump and a new context is created for 
the execution of the rule body. Rule (4) shows what 
happens when a rule body is exhausted; the saved 
context is restored. 

Rules (5) and ( 6 ) show how element specifications 
are performed. When an element specification is 
encountered in the program, a corresponding SAX 
event to start an element must be received. In this 
case, either one of the guard expressions is true, in 
which case the corresponding body element is per¬ 
formed, otherwise the else-clause is performed. In 
either case, a tag end instruction is added to the pro¬ 
gram which will test for the corresponding end tag 
SAX event ( 6 ). 

Rule (7) shows how actions are performed. The 
empty rule ( 8 ) defines that children of XML element 
can be specified as empty. 

The rules governing any are defined (9 - 13). If 
an any element is encountered when the next SAX 
event is text then the text is just ignored. If an any 
element is encountered when the next SAX event is 
a start tag then the corresponding end tag must be 
consumed, therefore an any machine instruction is 
created to ensure these match up (10). Rules (11-13) 
define how the any instruction is processed for each 
type of SAX event. 

Rules (14) and (15) define how binding ttakes 
place. When a bind element is encountered (14) the 
body element is added to the program along with a 
bind instruction. The bind instruction extends the 
environment with values in (15). 

Finally, text is processed in rule (16). 
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( 1 ) 

( 2 ) 

(3) 

(4) 

(5) 

( 6 ) 

(7) 

( 8 ) 

(9) 

( 10 ) 
( 11 ) 
( 12 ) 

(13) 

(14) 

(15) 

(16) 


(n(e) : p,p,v, (t,p') : x,d) 
(n(e) : p,p,v,/t : x,d) 
(n(e) : p, p, v, text(t) : x, d ) 


([], _,v,x, (p,p,d)) 


(( t,N , (J gt bi,b) : p,p,v, (t', p ') : x,d) 

i=l,n 

(/t : p,p,v,/f : x,d) 

([e] :p,p,v,x,d) 

(i empty : p, p, v, /t : x, d) 

(any : p, p, v, text(t) : x , d) 

(any : p,p,v, ( t,p ') : x,d) 

( any(t ) : p,p,v,/t : x,d) 

(■ any(t ) : p,p,v, (f , pf) : x,d) 

( any(t) : p, p, v, text(t') : x, d) 

((n = b) :p,p,v,x,d) 
i(n =) :p,p,w: v,x,d) 

(text : p, p, v, text(t) : x, d) 



([&]»w ^ P(e),v, (t,p r ) : x, ( p,p,d )) 

■when predict(n, t) = n(v) > b 
([b],v ^ p(e),v,/t : x, (p, p, d)) 

when predict (n, /t) = n(v) l> b 
([&], v ^ P(e),v, text(t) : x, (p, p, d)) 
when predict(n, text) = n(v) D> b 
(P, P, v, x, d) 

j ([bi,/t]+p,p® p', v , x, d) when t = t' A g»(p) 
1 ([&, /t\ +p, p® p',v,x,d) when t = t' 

(p, p, v, x, d) when t = t' 

(P,P,e(p) : v,x,d) 

(p,p,v,/t : x, d) when p = ft : p' 

(p, p, -L : v, x, d) 

(any(t) : p,p,v,x,d) 

(P, P, -L : v, x , d) 

(any(t r ) : any(t) :p,p,v,x,d) 

(any(t) : p,p,v,x,d) 

([b,n=\ : p,p,v,x,d) 

(p, p[n i—>• u>], tD : u, x, d) 

(p, p, t : 0, i, d) 


Figure 6 : Parsing Engine 


5 Analysis 

This paper has specified and implemented a DSL for 
parsing XML documents using the SAX event-based 
interface. The SAX interface is attractive because 
it is efficient compared to the DOM interface which 
constructs a model of the XML document before pro¬ 
cessing can start. The challenge in processing SAX 
events is how to shield the user from implementation 
details. Our approach is to use a DSL that allows 
XML languages to be expressed as a standard gram¬ 
mar. This paper has provided a specification and 
implementation of this language. The language has 
been implemented as part of the XMF language ori¬ 
ented programming (LOP) system which is open- 
source and available from [5]. Further details of the 
language can be found in 0 . 

In addition, XMF can be used to export the gram¬ 
mars to an Java implementation of the engine de¬ 
scribed in this paper. This allows XMF to be used as 
a compiler for XML grammars that produce stand¬ 
alone XML parsers. In these cases, the synthesizing 


actions are allows to be Java statements and can be 
used to make calls on other APIs. This approach has 
been used in a commercial context to process UML 
models encoded as XML 

Originally, XML based languages were expressed in 
DTD-format and latterly in XML schemas. [3] show 
that these formats can be expressed using standard 
technology from formal language theory (i.e. lan¬ 
guage grammars). The paper also investigates the 
properties of these grammars. 

Kiselyov [2] reports a number of XML parser imple¬ 
mentations in using declarative technologies includ¬ 
ing CL-XML (Common Lisp) [T], XISO (Scheme), 
Tony (OCaml) and HaXml (Haskell). As noted in [2] 
these are all DOM parsers and therefore suffer from 
the basic efficiency problems inherent in DOM. 

The parser reported in [2] is implemented using a 
functional style with many elegant features. How¬ 
ever, it is not a true DSL for XML parsing since it 
exposes the underlying implementation mechanisms. 
The XML grammar language reported in this paper 
is implemented using XMF which allows DSLs to be 
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embedded within other languages. 
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